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Abstract 

Feature Learning aims to extract relevant information contained in data sets in 
an automated fashion. It is driving force behind the current deep learning trend, 
a set of methods that have had widespread empirical success. What is lacking 
is a theoretical understanding of different feature learning schemes. This work 
provides a theoretical framework for feature learning and then characterizes when 
features can be learnt in an unsupervised fashion. We also provide means to judge 
the quality of features via rate-distortion theory and its generalizations. 


1 Introduction 


Machine Learning methods are only as good as the features they learn from. This simple observation 
has led to a plethora of feature learning methods. From methods that aim to learn features and 
a linear classifier in one go such as neural networks and predictive sparse coding US 0 ED, to 
methods based on conditional independence tests mmmm, to unsupervised feature learning 
methods (1323 EE E l [261 and of course good old fashion hand engineered features. While there 
exist many heuristic justifications for these methods, what is lacking is a general theory of feature 
learning. 


data 


Feature Map 
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We are all familiar with the above flow chart. Many methods exist to each of the above components. 
For a real application we are interested in measuring the predictive performance of the combined 
system. For the sake of understanding we seek means to measure the quality of each component. 
Thus we seek a measure of the quality of a feature map that is independent from the rest of the overall 
system, as well as a means to combine this with the generalization performance of a classification 
algorithm to provide bounds on the overall performance of the entire system. 


To this end we review both supervised and unsupervised feature learning schemes, presenting a 
novel supervised feature learning algorithm as well as novel transfer of regret bounds results. We 
draw inspiration from both rate distortion theory Q as well as the comparison of statistical experi¬ 
ments H3I22I- We provide to our knowledge th e first framework from which to understand feature 
learning as well as a characterization (theorem 5) of when unsupervised feature learning is possi¬ 
ble within our framework. Our characterization metrizes feature learning, in the sense that we give 
means to calculate the amount of information lost by any feature map. We show how many existing 
schemes for feature learning can be understood as surrogates to theorem 5. Finally we show how 
rate-distortion theory can be used to rank the quality of features. 


2 Notation and Preliminaries 

Throughout the paper Y. X, Z and A will denote the label, instance, feature and action spaces re¬ 
spectively. We allow A to be arbitrary to included both classification and conditional probabil¬ 
ity estimation amongst others. L will denote a loss function L : Y x A —► R + . Denote by 









||L|| = sup y a \L(y, a)| the norm of the loss. Furthermore for two sets X and Y the set of all 
functions / : X —► Y will be denoted by Y x . 

For a set X denote the set of probability distributions on X by V{X). Denote by || P — Q\\ the 
variational divergence between P and Q 11221 , a standard metric on probability distributions. 

Define & Markov kernel Da from a set X to a set Y to be a measurable function P Y \x '■ X —»• V (Y), 
in the sense that for all measurable / : Y -> Rwe have f*(x) = E y ^p Y{x ^f(y) is a measurable 
function. Markov kernels provide means to work with conditional probability distributions. As 
shorthand Py\x( x ) = Py\x- All measurable functions / : X —> Y define Markov kernels, with 
Py\x = ^ fix) • Denote by M(X, Y) the set of all Markov kernels from X to Y. 

Given two Markov kernels Pxiv and Pz\x we can compose them to form Pz\x ° Px\y '■ Y —f 
V{Z), essentially by marginalizing out X in the Markov chain Y —► X —> Z l25l[l9l . One has 

E P Z \x°Px\ y f = E x~p x ^z~p zi J(z) 

for all measurable / : Z —> R. Given a Markov kernel Py\x and a distribution Px £ 'P(X) we can 
form a joint distribution Pxv = Px 0 Py\x m the standard way. Similarly by Bayes rule we have 
Pxy = P\ 0 Py\x = Py 0 Px\y- Such a “disintegration” holds for very general measure spaces 

mm. 

We assume that learning follows the protocol: First, nature draws (x, y)~Pxy- Second, the learner 
observes x and chooses an action a. Finally, the learner incurs loss L(y,a). We view the loss 
function as an integral part of the learning problem. We place no restrictions on its form. We refer 
to Px\y, the class conditional distributions, as the experiment. 

Let P £ V(Y) and L be a loss. Define the Bayes act ap := arg inf a E y ~pL(y, a). If multiple Bayes 
acts exist then pick one of them. For many cases of interest there is always a unique Bayes act. This 
is true for all strictly proper losses 1221121 ] as well as kernel mean based losses ns®. As shorthand, 
L(y,P) = L(y,ap). Define the Bayes risk by L(P) := inf a E y ~pL(y, a) = E y ~pL(y,P). 
Similarly for any loss function define the regret HH 

Dl(P, Q) := E y~ P L{y, Q) - E y ~ P L(y, P). 

The regret measures how suboptimal the best action for distribution Q is when played against dis¬ 
tribution P. It should be obvious that the regret is always positive and equal to zero if P = Q. 

In supervised learning, one assumes a fixed but unknown distribution P\y £ P(X x Y) over 
instance labels pairs. One wishes to find a function / £ A x that chooses a suitable action upon 
observing a given instance. Ideally / should minimize the risk Rl{Pxy, f) ■= ^PxyP{v^ f ( x ))• 
If we allow randomized functions, ie Markov kernels Pa\x : -A —» 'P ( A ) then we can extend the 
definition of risk to Rl{Pxy , Pa\x) := Ep xy Ep A | X I/(y, a). For the purpose of finding minimum 
risks, randomization does not help. Denote by 

Rl(Pxy) = inf Rl(Pxy-J) 

f(ZA x 

fp XY = arg inf R l (Pxy, f ) 

/ ClA x 

the minimum risk and Bayes optimal respectively. By standard manipulations 

Rl(Pxy) = E x~p x L.(Py\ x ) = E^ x ^p XY L{y,P Y \ x ) 

and fp XY ( x ) = arginf a E p Y L(y,a), where Py\x is the Markov kernel obtained from applying 
Bayes rule to Py® Px\y- In practice one is normally restricted to / in some function class and only 
has a sample of n iid draws from Pxy with which to leam from. As our focus here is on “preserving 
the information” in Pxy , we shall in large part avoid such concerns. 

3 Supervised Feature Learning/ Loss and Experiment Specific Features 

For a multitude of reasons including but not limited to, computation, storage, the curse of dimension¬ 
ality, increased classification performance, knowledge discovery and so on we may wish to process 
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the instances through a (possibly randomized) feature map Pz\x- F° r a given feature map, learning 
follows the protocol: First, nature draws (x,y)~Pxy- Second, the learner observes z~Pz\ x and 
chooses an action a. Finally, the learner incurs loss L(y, a). Diagrammatically, 


data = (a ’,y)~PxY 


P’. 


(*,y) 


z\x 


(/(*),id 




By using the feature map we move from Pxy to Pzy with 

Ep zy / = E {Xty) ~ PxY E z ~p zl J(z,y) 

for all measurable / : Z x Y — > R. Hence Pzy = Py 0 {Pz\x ° Px\y)- Ideally Pzy should 
contain just as much “information” as Pxy , in sense that the feature gap 

A R l (Pxy,Pz\x) : = El(Pzy) - R l {Pxy) 

should be small. To be clear, R l (Pzy ) = infy gj 4 z Rl{Pzy, /), he. we are restricted to functions 
that only use the features. 

Theorem 1. For all joint distributions Pxy, feature maps Pz\x an -d loss functions L 
AR l (PxY, Pz\x) = E( x,z)~PxzDl{Py\x,Py\z ) 


For proof see additional material. Hence the feature gap is the average regret suffered in using 
features versus raw instances when acting optimally for both. In particular this means the feature 
gap is always non-negative. 


3.1 Link to Sufficiency and Conditional Independence 

The feature gap is closely related to the statistical notions of sufficiency and conditional indepen¬ 
dence. In particular we have the following theorem 

Theorem 2 (Blackwell-Sherman-Stein [25]). AR l (Pxy, Pz\x) = 0 for all loss functions if an 
only if X and Y are conditionally independent given Z. 

In fact the Blackwell-Sherman-Stein theorem is even stronger, if we are also allowed to change the 
prior Py on labels as well as the loss, and the feature gap remains zero then Z is sufficient for X 
Bans m. Z contains all the useful information in A' for predicting Y, in both the average risk 
and minimax sense. If Y is finite, then the vector of likelihood ratios ( .., dP ,fj Vn ) € I 

is always a sufficient for X. This in turn means the Bayesian posterior distribution is also sufficient 
for priors that do not assign some Y zero mass. In many cases we can do better and find Z sufficient 
for A, or close to, with Z of lower dimension than Y\ or even for Z finite. 

This observation has led to several classes of algorithms for supervised feature learning. One picks 
a loss with the property that D if P. Q) = 0 iff P = Q and then uses this loss as a surrogate for 
testing sufficiency by finding 

arginf A R l (P X y, Pz\x)- 

Pz\X 

Of course if infp z|x: AR l (Pxy, Pz\x) = 0 for one of these surrogates then by theorem 1 and as 
regret is always non negative, the feature gap will be zero for all losses. Some common surrogates 
include log loss leading to Dl(P, Q) = I)k t, ( P- Q) which leads to the information bottleneck i24l . 
More general Bregman divergences lead to clustering with Bregman divergences m . Finally, kernel 
mean based losses L : T x TL —► R with R a Hilbert space can also be used. Taking f : Y —>• Ft 
and L(y, v) = \\<j>(y) — v \\^ with </> characteristic J9| yields another suitable surrogate f27l . In this 
case DffiP, Q) = \\pp — pq\\^, the squared distance between the kernel means of P and Q. 

For all the previous cases, algorithms exist for performing the minimization. These include alternat¬ 
ing algorithms much like the Blahut-Arimoto algorithm of rate distortion theory 0 in the first two 
cases, with something a bit more involved in the third (although it is restricted to linear, deterministic 
feature maps). 

In practice, one might not know the exact loss function to use. Hence care must be taken in choosing 
a suitable surrogate or set of surrogates. We show in the examples section that the loss function can 
greatly influence how we rank features. This should be no of no surprise as the loss function defines 
the relevant information contained in Pxy E3. 
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3.2 Link to Deficiency 

If the loss is not known one can perform a worst case analysis 

sup AR l (P X y,Pz\x)- 
L,\\L\\< 1 

Worse case differences in risk as the loss is varied have been studied extensively in the sub field of 
theoretical statistics known as the comparison of statistical experiments HUE). In this area the 
focus is placed on the experiments Px\y and p z\y- 

Definition 3. Let Px\y and Pz\y be experiments on Y, and Py a distribution on Y. The weighted 
directed deficiency from P Z \y to Px\y is equal to 

i P Z\Y, Px\Y ) = inf E y ~p Y \\Px\y ~ Px\Z o P Z\yW 

■or | z 

The weighted directed deficiency measures how close we can make Pz\y to Px\y in the sense 
of variational divergence by adding extra noise Px\z- It is closely related to approximate notions 
of sufficiency mm, and it appears in an approximate version of the Blackwell-Sherman-Stein 
theorem. 

Theorem 4 (Randomization 1251 3. For all Py £ P(Y) and for all experiments Px\y and P z \y, 
AR l (P X y, Pz\x) < 4 L \\ <™d only if 5 Py (P z \y, Px\y) < e 

This theorem suggests a means to construct features when the loss function is not known, by mini¬ 
mizing the weighted directed deficiency. While this may appear difficult, one can exploit properties 
of the variational divergence that make calculating the weighted directed deficiency a L\ minimiza¬ 
tion problem (see additional material). As long as the sets X, Y and Z are finite, fast methods exist 
to solve this problem. One can obtain features by finding 

inf E y~p Y \\Px\y - p x\z ° P Z\X ° P X\y\\ 

Pz\x Fx\z 

and then using P z \x as the feature map. This can be solved approximately through an alternat¬ 
ing scheme of L \ minimization problems (see additional material). Examples of how this method 
behaves on some toy problems are given in the examples section. 

4 Unsupervised Feature Learning 

One major drawback of the previous supervised feature learning methods is that they require some 
knowledge of Py\x or Px\y■ The first three methods also require some knowledge of the loss 
function of interest. These methods consider a single supervised task in isolation. They extract the 
information in X that is relative to predicting Y. In many problems of interest we have access to a 
large data set of unlabelled samples drawn from Px. however we may have limited knowledge of 
the tasks that X will be used for. We desire a feature map that provides a compact representation 
of X, that looses no information about X. While at first this might seem vacuous, for example one 
could always just use the identity function, in many cases we can do much better. The data sets we 
tend to deal with have certain structure that we have not cared to directly specify in our models. This 
automated search for structure is what is behind the current deep learning fashion. 

Here we make the assumption that we have enough data to form an accurate estimate of Px, the 
marginal distribution over instances, and ask the following question. Under what conditions can 
we guarantee that a feature map P z \x does not lose more than e information about Y no matter 
what the relation between A' and Y or the loss function? The only restriction we place on possible 
relationships Pxy between X and Y is that the marginal distribution over instances is consistent 
with the one we have learnt. 

Theorem 5. For all feature maps P z \x> AR l (Pxy, p z\x) f e||L| for all Pxy , label spaces Y 
and loss functions L if and only if there exists a Px\z such that E x ~ PxK>~Px\z°Pz^ X ' ^ x ) ^ e 

In order to minimize the information lost from X, one needs to be able to reconstruct X from Z 
with high probability. We show in the next section that under some of the heuristic justifications of 
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deep learning techniques like the autoencoder l26l and the deep belief network lfj~3l . one is solving 
a surrogate to this problem. 

Theorem 4 also highlights the connection between feature learning and reconstruction. Recon¬ 
structing well is equivalent to finding generically good features. Theorem 4 also makes no use of 
interesting structure of the instance space X, effectively using the discrete metric on X, d(x, x') = 1 
if x 7^ x l . If one makes a smoothness assumption on the experiments of interest, a different version 
of theorem 4 is obtained. 

Definition 6. For all joint distributions Pxy and losses L the reconstruction regret is given by 

D r {x,X ) Dl ( Py\x > Py\x ') P(.V ) fP xy (^ )) P(,V ; fP xy (^)) 

The reconstruction regret is the regret suffered in choosing actions based on a nearby x' when in fact 
one should have used x. If we assume that X is equipped with a metric d : X x X — > 1, then we 
might wish to reconstruct well with respect to this metric. 

Theorem 7. For all feature maps Pz\x the following are equivalent 
1- 3P x \z such thatE x ~PxE x ,„p xizoPz ^d(x,x') < e 

2. For all distributions Pxy and loss functions L with D r (x, x') < A d(x, x') \/x, x', 
A-R l (PxYiPz\y) < eA 

For proof see additional material. Theorem 4 follows by taking d to be the discrete metric on X. 


4.1 Surrogates Approaches Motivated by Theorem 4 


Theorem 4 requires one to be able to reconstruct X from the features Z with high probability if one 
wishes generically good features. There are many surrogates to this problem. Many existing feature 
learning methods are motivated through an appeal to the Infomax principle M- Features should 
be chosen to maximize the mutual information I(X;Z) or equivalently to minimize the conditional 
entropy H{X\Z). 

Theorem 8 (Hellman-Raviv [12]|). Let X and Z be finite spaces. For all feature maps Pz\x an d 
priors P x , 


inf E^ Px E i(i 

Px\z 


Z°Pz 


¥{x’ 


x) < l -H{X\Z). 


Hence the conditional entropy bounds the smallest probability of error possible when one attempts to 
reconstruct X from the feature map Py\x- One can view the Infomax principle as being a surrogate 
to reconstruction error. By exploiting various representations of H(X\Z), many other surrogates to 
reconstructing with high probability can be obtained 121(261. For example, by properties of the KL 
divergence 

H{X\Z) =E ( x , z )~ Pxz ~ log(P x \ z (x)) = inf E (x>z) ~ Pxz - \og(P x \ z (x)). 

Px\z 

If we restrict the possible Px\z to distributions of the form P x \ z = Af(f{z) 1 a 2 ) (normal distribu¬ 
tions with mean f{z)) for some function / : Z —> X and standard deviation a. we obtain 

H(X\Z) < mfE^ z) ~p xz ^(x - f(z)) 2 -t-logfVW). 

If we restrict the possible feature maps to Pz\ x = $g(x) then we the autoencoder. Hence the autoen¬ 
coder can be seen as a surrogate to theorem 4. Its use can also be justified by theorem 5. Many other 
feature learning methods such as K-means and principle component analysis can be seen as specific 
instances of the autoencoder, g are linear projections for PCA and Z is finite for K-means. 


4.2 Rate Distortion Theory 

Rate-distortion theory provides lower bounds on the distortion, or in our terminology R p {Pzy), 
in terms of the rate /(X; Z) of the form Z)) < R l (Pzy) with <f>L the rate distortion 
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function. 


4>L{d) 


inf I(Y:A). 

P A \Y,Vp YA L<d 


Determining this function involves solving a series of convex problems, for which a fast iterative 
algorithm exists (7). The end to end performance of the complete system is captured in the rate 
distortion function, the quality of the feature map by I(X; Z). This bound provides a ranking of 
feature maps that depends only on the loss of interest and the mutual information of the feature map, 
and more importantly not on the experiment. Combined with theorem 5 one obtains bounds of the 
form 

Z)) < R l (Pzy ) < R l (Pxy) + l -H{X\Z)\\L\\, ML. 

Are there better surrogates? Ideally we wish to calculate R l (Pzy ), however this requires knowl¬ 
edge of P X \Y and not just the feature map Pz\x and marginal P x . We can calculate I(X;Y) and 
rely on rate distortion and deficiency theory to provide bounds. This begs the question, are there bet¬ 
ter surrogates? If we know the loss function can we do better than mutual information for providing 
performance bounds? At least in the case of the lower bound the answer is yes. In ll28l , a large class 
of generalized information measures are considered. For each of these information measures a rate- 
distortion theorem is obtained and in many cases using one of these instead of mutual information 
provides tighter lower bounds. 


Definition 9. For convex / : R + —> R with /(1) = 0, the /-information of a joint distribution P\y 
is given by 


I f {P XY ) = E PxY f( 


d(P x ® Py) 
dPxY 


)■ 


We present in the illustrations section an example of when using one of these measures of infor¬ 
mation provides a tighter bound than mutual information. This observation may have algorithmic 
implications. Ultimately the feature map Pz\x will be restricted to lie in some function class. If L 
is known it may be better to optimize one of these general forms of information rather than mutual 
information. 


4.3 Hierarchical Learning of Features 


One of the main tenets of the deep learning paradigm is that features should be learnt in a hierarchical 
fashion. Rather than learning a single feature map, one learns a chain 


X = Zn 




Pz 0 \z 1 


P Zl|Z 2 



with final feature map P Zri \x = Pz.„ \z.„ _, ° ■ ■ ■ o Pz x \x the composition of all the feature maps 

in the chain, and final reconstruction given by P x \z„ = Px\Zt ° • • ■ o Pz n _ 1 \z n - Such a scheme 
has obvious computational advantages, one can learn each layer in a greedy fashion. To analyse the 
entire system, one can invoke a union bound obtaining 


E t 




E„ 


r X\Z, 


. oP z 


^ x') <J2^P Zi ^ 


, °Pz 


i =0 


;+i \ z -. 


\Y(zi ± z \) 


i.e., the probability of reconstruction error for the entire system is bounded by the sum of the prob¬ 
ability of reconstruction errors for each layer. See additional material for a proof. Hence the deep 
belief network and other hierarchical methods can be seen as solving a surrogate to theorem 4. 


4.4 Semi Supervised Learning and Transfer of Generalization Bounds 

In semi supervised learning one wishes to learn a classifier / € A x from a data set comprising of n 
draws from PxY and m draws from P x , where normally m » n. To tackle this problem one can 
learn a representation of X via a feature map P z \ x from the unlabelled data. One can then learn a 
classifier g £ A z from the labelled data (z-i, yfj^PzY, Zi~Pz\ Xi - Theorem 5 allows one to analyse 
the generalization performance of such a joint system. If something is known about the sample 
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complexity of learning g and of learning Pz\x then theorem 5 allows one to combine these to give 
a sample complexity for learning both. Much is known about the sample complexity of supervised 
learning. For the sample complexity of (some) reconstruction schemes we point the reader to the 
recent work Gam These works give sample complexity bounds for many different reconstruction 
schemes under square loss, in particular k-means, principle component analysis and sparse coding. 
Our results allow one to transfer these results to the semi supervised learning domain. 


5 Illustrations 

In this section we give some simple examples of how the different feature learning schemes discussed 
operate in practice. We also give examples of when one can learn sufficient features for a particular 
experiment as well as when it is possible to learn generic features. 

Experiment Specific Features. Let Y = K with A' = R n and P x \y given by the product of 
n normal distributions with mean y and variance 1. It is easy to verify that the sample mean 6 : 
X —>• ffi. is a sufficient statistic meaning that at least for this experiment we can greatly compress 
the information contained in X. However, if we take as a prior for Y a normal distribution of mean 
0 and variance 1, then the marginal distribution Px will not be concentrated on a set of smaller 
dimension nor have any particularly interesting structure. Hence we can not find interesting generic 
features in this case. 

Experiment and Loss Specific Features. Let Y = {—1,1} with P x \ y = A r (y, 1). For this 
experiment, 0-1 loss (Lot) and a uniform prior the Bayes optimal / is given by fix) = 1 if 
x > 0 as P(—l|a;) > } and f(x) = —1 otherwise as P(—l\x) < \ . It is easy to show 
that AR L (Pxy, f) = 0, all we need is the output of /. However if we change the loss to a 
cost sensitive loss L c ll22l where misclassifying a 1 is more costly than a —1, we no longer have 
AR l (Pxy, f) = 0, as this would change the optimal threshold for classifying a 1 versus —1. 
However, if there was a jump discontinuity in P(— 1|A'), ie it jumped from say 0.4 to 0.6 as x 
crossed over x = 0 then the feature gap would be zero for a broader range of cost sensitive losses. 
Once again there are not generic features of interest. 

Loss Sensitive versus Loss Insensitive Features. Let Y = {1,2,3} with a uniform prior for Y and 
Py |y given by the normal distributions in the figure below. Consider the feature space Z = {1,2}. 
Below are plots of the features learnt by two different feature learning schemes. The first is the 
loss insensitive weighted directed deficiency minimization method. The second is the information 
bottleneck where we know before hand that misclassifying a 2 is more costly than misclassifying 
one of the others. A loss of this form is achieved by tilting the standard brier loss ll22l toward class 
2. The green regions are those x that are mapped to the feature 1, the blue are those mapped to 2. 


Generic Features 


Bottleneck Features 



Figure 1: Loss Sensitive versus Loss Insensitive Features, see text 

We can see even in this simple example that the loss function matters when determining sensible 
features. While the weighted directed deficiency method divides X into regions that allow good 
reconstruction of all the class conditionals, the bottleneck features focus on separating class 2 as 
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dictated by the loss function. While the weighted directed deficiency Sp v ( Px\y■ Pz\y ) being 0.629 
and 0.698 respectively indicating that from a worst case perspective the two feature maps are very 
similar. However, for the particular loss we have used th e feature gap is very different, 1.075 versus 
0.325. 

Learning Generic Features. All previous examples have considered a. fixed experiment. When 
learning features in an unsupervised fashion, one wishes to find features that work for all experiments 
that use X. There are many examples of when this is possible, and they all boil down to some sort 
of manifold assumption. If Px is concentrated on some lower dimensional subset of X, then one 
can find generic features. 

Rate Distortion Lower Bounds. As an example of the different bounds one can obtain using /- 
informations, we consider a simple example where Y = {0,1} and the loss is a cost sensitive 
misclassification loss with L{ 0,1) = 1 and L(l, 0) = 4. We consider the feature map 

D _ /0.8 0.1 0.l\ 

Fz \ x ~ [o.l 0.4 0.5 ) 

given as a row stochastic matrix with uniform prior Px- We consider f(x) = (y/x — l) 2 resulting 
in Hellinger information. Below are plots of the rate distortion curves for both mutual information 
(red) and Hellinger information (blue) as well as the informations of the channel (the dashed hori¬ 
zontal lines). The black vertical line represents the lower bound on the distortion. For this channel 
Hellinger information gives a tighter lower bound. For further illustrations see additional material. 


Rate-Distortion Curves 



Figure 2: Generalized Rate-Distortion Plots, see text 


6 Conclusion 

Automated feature learning methods have produced remarkable empirical results, however little 
theory exists explaining their performance. This paper provides direction as to how progress the 
theory. To this end, we have placed several current supervised feature learning methods in a general 
framework, provided a novel loss insensitive method for learning features as well as providing novel 
means of transferring regret bounds from unsupervised feature learning methods to supervised learn¬ 
ing methods. Finally, we have shown the usefulness of rate-distortion theory and its under utilized 
generalizations in ascertaining the quality of learnt features. 
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7 Additional Material 


7.1 Background on Proper Losses 

Here we review some material that greatly eases working with proper loss functions and highlights the connec¬ 
tion between loss, Bayes risk, regret and Bregman Divergences 011. 

Definition 10. A loss function L :Y X V(Y) —> R is proper if for all P G V(Y) 

P£ arginfEj ,~ P L(y,Q) 

Qev(Y) 


Any loss function can be properized. 

Theorem 11. Let L : Y x A —¥ R be a loss. For P € V(Y) Define 

ap = arginfE y ~pL(y,a) 

a 

where we arbitrarily pick an a £ arginf a K v ~pL(y, a) if there are multiple. Then L{y, P ) = L(y, ap) is 
proper. 

It is possible that by using this trick we remove useful actions a £ A. However, for the purpose of calculating 
expected risks we do not require these actions. From L, one can define a regret 

D(P, Q ) = E y~ P L(y, a Q ) - E y ~ P L(y, a P ) 

which measures how suboptimal the best action for the distribution Q is when played against the distribution 
P. One does not need knowledge of the original loss to construct L, only the Bayes risk 

L(P) = infEj ^pL{y,a) 

a 

is needed. From this one can reconstruct L, and hence L for the puiposes of calculating minimum expected 
risks. This is achieved by taking the 1-homogeneous extension of L 

L : R^ 1 ->■ R 

II II T! V \ 

V !->■ I) hf , ) 

and differentiating/taking super gradients. The following three theorems highlight the usefulness of the 1- 
homogeneous extension. 

Definition 12. Super Gradient Function Let / : C C R n —>■ R be a concave function. Then 

V/ : C —> R“ 

is a super gradient function if for all x € C, V f(x) G df{x) 

Theorem 13. For any concave 1-homogeneous function f : C C R n —y R and any super-gradient function 

V/, 

f(x) = {x,Vf(x)) 

Theorem 14. For any concave L : P(Y) —> R 

L(y,P) = (6 y ,VL(P)) 

is a proper loss. 

Theorem 15. The regret derived from a proper loss L is equal to the Bregman divergence defined by L(P). 

D(P, Q) = Dl(P , Q) = D l (P, Q) = L(Q) + {P~Q , VL(Q)) - L(P) = (P, VL(Q) - VL(P)>. 

Finally give a concave 1-homogeneous Bayes risk L and a vector 7r G RjT' one can tilt L by 7r yielding 



with 7vv the element wise product of 7r and v. It is easily verified that this new function is both concave 
and 1-homogeneous. The tilting has the effect of making certain elements of Y being more important in the 
resulting loss. For example if we start with a symmetric Bayes risk like Shannon entropy, and tilt by the vector 
(1,10,1), then the resulting loss places more importance on predicting t /2 correctly. This is analogous to how 
cost-sensitive misclassification losses are produced from 01 loss. 
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7.2 Background on the Information Bottleneck 

For a given joint distribution Pxy and loss function L , the information bottleneck/ clustering with Bregman 
divergences attempt to extract a feature map by solving 

inf pAR L (P XY , Pz\x) + I(X-, Z ) 

p z\x 

i.e. a regularized feature gap, with the mutual information I(X; Z) serving as the regularizer. This can be 
solved by an alternating algorithm. Here we review the derivation of this algorithm. 

Theorem 16. 


inf /?A R l (Pxy,Pz\x) + I{X-Z) 
p z\x 

= inf inf inf /SEb~p x E*~p z , Dl(P@\ x , Pq\ z ) + Kx~p x Dkl(Pz\xi Pz) 

p z\x P 0 \z p z 


For the proof we require the following lemma from fTl 

Lemma 17. For all concave L : A' —> R and distributions P £ V{X) 

Ep.Y G arginfE y ~pDL[y,x). 

X 

The mean is the expected Bregman divergence minimizer. 

We can now prove the theorem 


Proof. Firstly, 

I(X;Y ) = ^Zx~p x Dkl(Pz\x, Pz) = inf ^ x ~p x Dkl{P z \ x , Pz)- 

p z 


as E x^Px Pz\x 


— Pz- Secondly 

E x ~p x E 2 ~p z|i Z)L(T’e| :i; , Pe\z) = E z ~p z ^x~p x[z DL(Pe\ x , P@\z ) 

= Ez~p z inf E^p . Dl(T'©i x , Pqu) 

= inf Ez~p z E 1 ~p jc , Dl(Pq\ x , P@\z) 
p e\z 


as ¥, x ~p x ^Pq\ x = P 0 | 2 . Combining gives 

inf E*~P*/® 2 ~p*.Di(Pe| X) Pe| 2 ) +E x ~p x Dkl(Pz\ x , Pz) 

p z\x 

= inf inf inf ^E j; ~p x: E 2 ~p z , D^Po^, Pq\z) + ^x~p x Dkl(Pz\x, Pz)- 

p z\x P 0|z P z 

This completes the proof. pf 


The above theorem allows one to (at least approximately) find loss specific features. 

7.3 Background on Loss Insensitive Feature Learning 

Recall that loss insensitive feature learning seeks to find a feature map P Z \x and a re-constructor Px\z that 
minimize 

inf Ej ,~p Y \\Px\y — Px\z ° Pz\x ° Px\ y \\- 

p z\X’ p x\z 

We show how this can be achieved by an alternating pair of linear programs. Assuming that X, Y, Z are all 
finite sets, Px\y, Pz\x and P z \x can be represented by column stochastic matrices T, F, R respectively, with 
composition represented as matrix multiplication. Furthermore Py can be represented by a probability vector 
v. The variational divergence between two distributions is the L i distance between their probability vectors 
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(22. For fixed F taking an infimum over R means solving the following linear program 


Zi 


1*1 m 



subject to 


> o Vi, j 

1*1 


£) = 1 V -7 


1*1 |*| 

RikFkhThj\ f Zij Vi, j. 

k=l h= 1 

The final constraint can be written as a pair of linear constraints. Fixing R and taking an infimum over F means 
solving the following 


inf 

Z a ,Fi 


i*i m 

,££** 
' t=i j =i 


subject to 



> 0 Vi, j 
= lVj 


i= 1 

l*i i*i 

| ViTij RikFkhThj I < Zij Mi,j. 

k =1 h=l 

Alternating these two minimizations provides means to find loss insensitive features. 


7.4 Proofs for Some Theorems in Main Text 
7.4.1 Proof of Theorem 1 

Theorem. For all joint distributions Pxy and feature maps Pz\x 

AR l (Pxy,Pz\x) = E( x,z)~p xz D l (Py\x, Py\z) 


Proof. 


R l {Pzy ) — R l {Pxy) — Ep zv L(j/, Py\z) — E p xv L(y, Py\x) 

= ^‘{x,y)~Pxy^‘ z ~ P Z\n: \P{Vi Py\z ) — L(jj, 7V| X )] 

— ^ t (x,z)~Pxz^ t y rv * > Y\x Py v) Py\x)\ 

= ^(x,z)~P xz D L (Py\ x , Py\ z ) 

where the second last line follows from the fact that YYZjX as Y X Z forms a Markov chain. □ 


7.4.2 Proof of Theorem 5 

Theorem. For all feature maps Pz\x the following are equivalent 

1. 3P x \ z suchthatE x ~p x E x ,~p xizoPz ^d{x,x') < e 

2. For all distributions P X y and loss functions L with D r (x, x') < A d(x, x 1 ), 
AR l {Pxy , Pz\y) < eA 
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Proof. [1 => 2] Let fp XY be the Bayes optimal for Pxy and L, and consider the following randomized 
function Pa\z = fp XY ° Px\z> i.e. the composition of the Bayes optimal and the re-constructor. 


AR l (Pxy, Pz\y) < Rl{Pzy,Pa\z) ~ R l {Pxy) 


e (2 

,y)^ 

•PZY 


^P X \ z L (yJ p xY( x ')) - 

E(x,!/) 

~P X Y f P XY 

E (i 


'P X Y 


r -Px\Z°Pz\x fPxY 

(*')) - 

Hih fp X Y (*))] 

E(a 

■,y)^ 

>P X Y 

.E , 

X 

~Px\Z°Pz\x f P XY 

00) - 

HvJpxyH))} 

Ej;. 

-Px 

E 

x' r 

“P.X 

\Z°Pz\x^ y ^\x [-^(2/j f P XY ( X 

))~L(y,fp XY (; 

E x , 

-Px 


“P.X 

\z°P Z \ x Dr ( X,X ) 



E x , 

-Px 


-P.x 

| Z o p sw *d{x,x’) 




< eX 


□ 


Proof. [2 => 1] Let Y = X, A = A' and L(x', x ) = d(x', x). Finally let Pxy = Px 0 idx, i.e. draw x~Px 
and return (x, x). It is easy to confirm that fp XY (x) = x and R l {Pxy) = 0 and D r (x',x) = d(x' , *). By 2 

e > A.R l (P X y, Pz\y ) 

= R l (Pzy) 

= inf Ep zy E p L(y,a) 

Pa\ x sv(x')Z 

= inf E X ~ PX E Z ~ P . E P , d{x,x) 

p a{ z ev(x')z x z|x x 


hence 1 is satisfied. 


□ 


7.4.3 Hierarchical Learning of Features Proof 

Theorem. For all chains of feature maps and reconstruction functions 

P Zr.|Z„_l 
P Z„_ % \ z n 


X = Z 0 


P Z 2 \Zl 


p z 3 IZ 2 


: Zi 


: Z 2 . 


the probability of reconstruction error for the entire chain is bounded by the the sum of the reconstruction 
errors for each layer 


E x 


r x\z n 


r Z n \- 


ip(x ^ x) < y ' E Zi ~p z 


E, 


p JF(zj ^ z'i) 

i\Zi+l oP z i+ i\Zi v ^ 


Proof. Let (zo,zi ,..., z„) be the “true” elements at each level of the chain and (z' 0 , z[,..., z' n _ i) their re¬ 
constructions. Consider the joint distribution P with 

P(zo,Zl, . . . ,Zn,z'o,z[, . . .,Zn- 1 ) = P {z 0 ) P (zi\z 0 ) P(z 2 \zi) . . . P(z n \z„ - l)P(*£_i \z n ) . . . P(z 0 \z[) 

= P X {z 0 )P Zl \ Z0 {zi) . . . P Zn \z n _ x (^)Pz„_ 1 |z„(4-l) • • • Pz 0 \z[ (4)- 

Under this joint distribution 

P(z 0 z'o) = P(z 0 fz z' 0 n Zi = z'f) + P(z 0 ^ z'o n Zi z[) 

< P{zo 4 4 n Zl = z[) + P(zi zf) 

To complete the proof, note that P(zo z'o C\ zi = z[) = E x ~p x E oP Z(x x') and proceed 

inductively. 

□ 
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7.5 Standard Rate-Distortion Theory 

Given a channel P z \x rate distortion theory provides means of assessing lower bounds of the distortion of 
the channel by a function of the channels rate (maximum mutual information or capacity). For any prior P Y , 
experiment Px\y, feature map P z \x, estimator Pa\z and loss function L one defines the distortion 

d, = Ej / ~p y lE :c ~p X | y E z ~p Z |_ Jf: E z ~p J | Z L(t/, a) 

and rate 


and rate-distortion function 


R = sup I(X;Z). 
Pz 


<j> L {d)= inf I(Y-A) 

PA\Y&Py A L<d 


i.e., the smallest mutual information of all channels P A | Y with distortion less than d. The rate-distortion 
function is non-increasing, the higher the distortion the lower the required rate. 

One obtains a lower bound of the distortion of the form <f> p 1 (-R) < d. (j> is the rate-distortion function 


<Md) 


inf ICY-, A) 

PA\Y&P YA L<d 


Key to the rate distortion bound is that mutual information satisfies a data processing inequality, for a Markov 
chain 

Y X -> Z ->■ A 

I(X; Z) < I(Y ; A) f7j. In particular this means for a Markov kernel of the form P A \y — Pa\z°Pz\x°Px\y 
to have distortion less than d, 

Md) < I(X-Z) < R. 

This condition is necessary but not sufficient, leading to slack in the lower bound. Both the rate and the rate- 
distortion function can be computed via an iterative algorithm. We direct the reader to f7) for derivations of the 
bound as well as the algorithm for calculating it. The major strength of this bound is that it applies for all Py, 
Px\y,P Z \x and Pa\z- If the marginal Px is known, the bound can be further tightened to 

4>- L \l{X-Z))<d. 

Rate distortion theory provides another justification of the use of mutual information as a surrogate for feature 
learning (different to theorem 4), and also provides means to assess how good a surrogate it is via the rate 
distortion function. On the following page are plots of the rate distortion curve for two different loss functions, 
firstly Brier loss and secondly the tilted Brier loss from the example in figure 2. From the plot one can see that 
more mutual information I(X ; Z) is required to have low distortion for the tilted Brier loss than the standard 
Brier loss. This is because the tilted Brier loss greatly penalizes mistakes made when classifying class 2, while 
penalizing other errors in a similar way to standard brier loss. 


7.6 Tighter Bounds via Generalized Rate Distortion Theory 

Definition 18. For convex / : R + —► R with /(l) = 0, the f -information of a joint distribution Pxy is given 
by 


IA*;Y) 


If (Pxy) = Ep xi - 


d(Px ® Py) 
dPxv 


When f(x) = — log(a;) we recover the mutual information. Much like mutual information, /-information also 
satisfies a data processing inequality. For any Markov chain 

Y —j- X —>■ Z —>• A 


If(X-Z) < If(Y-A) m- As such one can use /-information to construct an alternative rate distortion 
function 


<l>L,f(d) 


inf If(Y; A) 

P Al Y^P YA L<d 


and an alternative lower bound. Unlike the case of mutual information, there is not a fast iterative algorithm to 
calculate this function. However, it is easy to show that for fixed d the above is a convex optimization problem 
(as /-divergences are convex (221). 


14 



Rate-Distortion Curves 



^Brier Loss 
^Tilted Brier Loss 


Figure 3: Rate-Distortion Plots, see text 
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